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ABSTRACT 


A linear systolic array (SASP— Systolic Array Signal Processor) for 
DSP applications has been proposed. SASP is divided in two major functional 
blocks - interface unit, and processor array. The bulk of the computing power 
comes from the processor array which consists of a linear array of locally 
interconnected identical processing elements called 'cells'. Data for computation 
is dumped onto the interface unit by the external host. During execution, the 
interface 'pumps' data into the processor array in the required regular pattern 
and generates addresses used by the cells to access their local data memories. 
Each individual cell can be microprogrammed through the interface which itself 
also works under microcode control. A microengine and an address generation 
unit of the interface have been implemented and a simple cell has been made to 
execute some common DSP algorithms. The mapping of algorithms on this linar 

A 

array structure is discussed. 
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Chapter 1 


INTRODUCTION 

S 1.1 COMPUTATIONA. REQUIREMENTS IN DIGITAL SIGNAL PROCESSING ft CURRENT 
PARALLEL ARCHITECTURES 

Very high speed computation is required in a variety of digital signal 
processing applications. The increasing utilization of TV imaging in geophysical, 
medical and industrial environments has led to an increasing demand for high 
speed signal processing. These applications require processing in 'real time' or 
'near real time ' environments, where the data rate of processed images are 
almost the same as that of input images. For example, an image size of 512x512 
pixels and a frame rate of one frame per millisecond gives an input data rate of 
262 Mpixels/sec. In a vision processing system, the job of recognizing an object 
and checking its geometric and physical properties would require an approximate 
processing speed of 

10oos/pixelx5i2x512xi000 = 2621 Mops (Million operations per second) 
CSYKung883. 

Huge amount of computing power is required to support such processing 
demands. But present day technology has almost reached the stage where 
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significant speed-up can no more be squeezed out of a single processor. 
Therefore, fast single processor sequential computers are gradually giving way 
to concurrent computing systems which try to exploit the parallelism, 
simultaneity and pipelinability of the events occurring in a computing process 
[Laz863. 


Modern parallel computers can be classified in three architectural 
categories: pipeline processors, multiprocessor systems, and array processors 
CHwang84]. Structures of the first kind exploit temporal parallelism while the 
second one achieves asynchronous parallelism through a set of interactive 
processors. These two classes of architecture are mainly found in general 
purpose computer domain. The third type of structures find applications as 
special purpose computers intended for applications like signal and image 
processing. 

Although the general purpose supercomputers well live up to their 
commitment of high speed computation (e.g. iOO Mop vector operation in Cray-i; 
Cray-2 claims an 16fold increase in performance; 10 Gop peak performance in 
ETA- 10 etc.), their immense 'number crunching' power may appear elusive in a 
particular area of operation. The reasons can be manifold — general purpose 
computers can not fully exploit the parallelism of all classes of problems 
CDew84] ; moreover, 1/0 bandwidth requirements may stand in the way to improved 
performance. Besides this, the cost of a general purpose supercomputer may not 
justify its use in a narrow, specific field of application. 

Thus, before building a special purpose machine, there should be an 
attempt to meet two major justificationsCStoneSS]-- 
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1) The special purpose system performs a function faster than any 
existing general purpose computer. 

2) The cost of developing the special purpose system should be justifiable 
in spite of its narrow range of application. 

The first justifioation is found in signal processing applications (e.g. 
pattern recognition, speech, sonar, radar, seismic, weather, astronomical, medical 
signal processing), where large number of computations must run in real time to 
support enormous throughput rates. As regards the second justification, recent 
advances in VLSI technology have made it possible to construct large, special 
purpose systems at relatively low cost, since such systems can be composed of 
collections of powerful, general purpose VLSI parts. Extra speed-ups in 
execution for a specific class of problems can be achieved by relating the 
algorithm and architecture more closely and thereby exploiting the 
characteristics of the problem solving method. 

S L2 ARRA/ PROCESSOR ARCHITECTURES 

Array processor architectures are extensively used for special 
purpose computing structures. An array processor uses multiple synchronized 
processing elements (PE's) which operate in a spatially parallel manner. The 
difference between array processors and multiprocessor systems is that the 
PE's in an array processor operate synchronously, but processors in a 
multiprocessor system may operate in asynchronous manner. In this section, 
different array processor architectures will be discussed with a brief review 


of some existing architectures. 
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i> Single Instruction Multiple Data (SIMD) arrays 

An SIMD array is a synchronous array of processing elements (PE's) with local 
memories and local connectivity between them. All the PE's operate under control 
of a central control unit (CU>. Instructions are broadcast to the PE's which 
execute the same instructions on different data sets. Examples of SIMD systems 
are ILLIAC IV, Massively Parallel Processor (MPP), Distributed Array Processor 
(DAP) etc. 

ii) Multiple Instruction Multiple Data (MIMD) arrays 

An MIMD system consists of a number of PE's, each of which has its own data, 
control unit and program. Thus, the overall processing task can be distributed 
over different PE's. MIMD machines are prone to communication bottleneck and 
this factor has affected their popularity in spite of their highly flexible 
structure. They are more suitable to handle irregular algorithms. The dataflow 
machine is an MIMD computer in which an instruction is ready for execution as 
soon as its operands become available. 

A combination of SIMD and MIMD principles is also used in order to have 
a balance between simple communication and architectural flexibility. Examples 
of such systems are PASM, pyramid architectures etc. 

• MPP 

It is a specialized system whose design was motivated by needs to process 
millions of satellite image data quickly. MPP employs a 128x128 array of PE's 
using SIMD mode of parallelism. The data stream is processed by the array unit 
(ARU), while the instructions are processed by the array control unit (ACU). The 
staging memory acts as a buffered I/O path of the ARU and allows transfer of 


data with the outside world in the optimum format. ACU is divided into three 
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parts which allows array arithmetic, scalar arithmetic, and input-output 
operations to take place in parallel [BAT85]. MPP is capable of 32 bit 470 MFlops 
ALU operations (i.e. addition, subtraction & logical operations) and 32 bit 291 
MFlops multiply operations. 

• PASM 

This system is designed mainly to be a research tool for studying the use of 
large scale parallelism in problem domain of image understanding. It is a 
partitionable SIMD/MIMD system which can be structured as one or more 
independent SIMD and/or MIMD machines. Here, an array of N PE's are controlled 
by Q microcontrollers (MC's). Each MC controls N/Q PE's. Thus, a maximum SIMD 
configuration of size N can be obtained by loading all the MC's with the same 
instruction simultaneously. Each PE has its own local memory which can be used 
for data storage in SIMD mode and for both data and instruction storage in MIMD 
mode. MC68000 processors are used as basic processor units. The full PASM 
system is envisaged to have N=i024 and Q = 32. 

• P'j'amid Architecture 

Hierarchically organized and locally interconnected pyramid structures (also 
called processing cones or layered recognition cones) support simple yet highly 
efficient image processing algorithms [Burt841. Pyramids, in general, provide 
successively condensed information (e.g. reduced-resolution, fine-coarse 
approximation of some descriptive information) of the 'raw' images presented at 
the base layers. TItus, meaningless low-level iconic representations of the image 
are organized and converged into higher-level, meaningful, symbolic labels. 

Pyramids have the advantages of parallel processing, pipelining, 
message passing (this reduces interprocessor distances from 0(N) to Odog N) in 
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an NxN array). The pyramid being built by Boeing will have 7 array layers : 
64X64, 32x32, 16x16, 8X8, . . .1x1, which will act in SIMD mode under the 
control of a central controller CUhrSSl. Design considerations for a pyramid 
machine in VLSI are discussed in CDy82]. 

• VLSI array processors 

Remarkable advances in VLSI circuitry have given the impetus to this new area 
of parallel architectures. But VLSI has its own constraints. Components and 
interconnections are made of same 'stuff and hence area is a uniform cost 
measure for both. Time of communication also depend upon the distance between 
the two CLeis833. Thus VLSI architectures stress on requirements like 
communication with nearest neighbours, less interconnection, less supervisory 
control, less I/O bandwidth between PE's and less overheads in communication. 

Host of the signal processing algorithms have some common 
characteristics such as regularity, recursiveness, localized communication etc. 
Consequently VLSI architectures are becoming popular for signal and image 
processing applications. Systolic arrays and wavefront arrays are typical of 
such structures. 

In a systolic array, all the PE's operate in synchronism with a global 
clock while data flows from the main memory in a rhythmic fashion, passing 
through different PE's before it returns to the memory [HTKung821. Systolic 
array architectures are specially useful for compute bound problems, where the 
number of computations is greater than the number of I/O operations performed. 
This is achieved by ensuring that, once a data is accessed from the memory, it 
is used efficiently at each cell it passes while being propagated from cell to 
cell in the array. Wavefront arrays combine the concept of dataflow with 
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systolic computation thus alleviating the requirement of global synchronization 
in a systolic system. 

Although systolic structures are well suited for VLSI implementation, 
high performance systems based on systolic principle have also been made using 
discrete components. 'Warp' processor developd at CMU [HTKung843 is a linear 
systolic array implemented with off-the-shelf IC's. Saxpy Matrix-i is another 
example. 

f 1.3 SYSTOUC ztRRA/S 

In this section we will describe the basic principles of a systolic 
array. A systolic array is a parallel special-purpose architecture made out of a 
number of processing elements of a few types called cells [Quin863. Each cell 
performs some simple operation and the cells are interconnected in some regular 
pattern (e.g. linear, mesh, or hexagonal network). The basic principle of systolic 
computation rests on the fact that by replacing a single processing element with 
an array of cells, a higher computation throughput can be achieved without 
increasing the I/O bandwidth requirement. Following are the basic features of a 
systolic array. 

• Spatial regularity and locality 

The design must be made with finite type of processing elements which are 
locally interconnected. Use of buses for purposes other than clocking is not 
allowed, although it may be useful in certain situations. 

• Tenwaral regularity and synchrony 

Each cell should behave as a finite field automata, in synchronism with a global 
clock. But this does not prevent the cells from executing different programs at 
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different time instants. 

• Pipelinability 

If a systolic array has N cells, it must achieve a linear speed-up of 0 CN). 

• I/O proximity 

All input/output operations should be done with boundary cells only and no 
internal cell can access the outside world directly. Thus in a mesh connected 
array of SI colls, the number of I/O operations must be 0 (4 n ). 

• Modularity 

Systolic arrays are designed for solving a particular class of problem, and the 
array can be extended simply by adding new cells to handle larger size 
problems. Application area can also be widened by adding cells of different 
types to take care of the additional functions that may be required by new 
problems. 

The properties of a systolic array as described above account for a 
number of advantages of such systems. The property of spatial regularity 
divides the time required for developing a system by a regularity factor. 
Locality ensures avoidance of long capacitive wires. Temporal regularity and 
synchrony reduces the number of control lines. Pipelinability provides high 
performance. I/O proximity keeps the I/O bandwidth low. Finally, modularity 
allows design of a system according to the size of the problem. 

f 1.4 OBJECTIk'ES /tHD SCOPE OF THE CURRENT WORK 

In this thesis, an attempt has been made to develop a systolic system 
for executing a class of digital signal processing algorithms. The objectives of 
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the current work are described below. 

i ) To define a systolic architecture for DSP applications which will work 

as an attached processor to an external host. 

ii ) To map some common algorithms onto the suggested architecture. 

iii ) To design and implement an interface cell for the array to support I/O 
requirements of the array as well as to provide independence to the systolic 
array from the host during execution of algorithms. 

iv ) To implement a simplified cell to test the working of the system. 

• Chapter 2 describes the architecture of the Systolic Array Signal Processor 
(SASP) which has been partially implemented in this work. We will follow the 
array configuration that was proposed by Nemawarker CNem883 and concentrate 
on the the architectural details of the functional units. 

• Chapter 3 discusses the general mapping procedures to project conventional 
sequential algorithms onto a specified array structure. 

• Mapping of LU and QR decomposition algorithms on SASP is given in Chapter- 
4. 

• Chapter-5 details the design and implementation aspects of a part of the 
interface cell (remaining aspects are considered in the thesis [Us893). 

• Chaoter-6 deals with the design of a simplified cell and discusses execution 
of a few algorithms on the cell in conjunction with the interface cell. 

• We conclude with Chapter-?, noting down the experience gathered during the 
current work and suggesting some enhancements over the current design. 



Chapter 2 


THE 'SASP' ARCHITECTURE 


f 2.1 INTRODUCTION 

Systolic array structures support high speed computing by effective 
use of a large number of processors. Although systolic systems follow some 
standards regarding interconnection patterns (e.g., localized communication) and 
a few structural features (e.g., pipelining of data and intermediate results), 
details of the architecture depends mainly upon the application requinements. 
This is because such systems are meant to be used as application specific 
processors attached to a general purpose computer. 

Thus, the application needs give an important cue to the identification 
of a suitable architecture, and thus architectural optimization for a narrow set 
of algorithms can be done. Ease of implementation and programming can guide the 
choice. But at the same time, attempts should be made to balance some other 
apparently conflicting factors such as generality, flexibility and efficiency. 

In this chapter, an architecture that attempts to satisfy the above 
requirements is proposed, and its main features are discussed. 
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S 22 SYSTEM /V^CHUECTURE 



FIG. 2.1 SASP machine overview 

The SASP machine (Fig.2.i) consists of two parts - processor-array (PA), and 
the interface unit <!FU). 

i ) Processor-array Main computing power of SASP stems from 

processor-array. It is a linear array of identical processors called cells. Each 
cell is a programmable microengine with a microcoded control unit of its own. 
The microcode for the cell can be downloaded by the host through the broadcast 
bus (BC-bus). 

Each cell is connected to three input channels— X-bus, Y-bus and adr- 
bus ; apart from the BC-bus. Operand data are presented on X-bus and Y-bus. 
Generally X-bus carries the input data which is passed unchanged from one cell 
to the other in a pipelined fashion while Y-bus contains the partial output. Data 
on the Y-bus get updated as it pass through each cell, and final output is 
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obtained from the output of the last cell. 

A local memory in each cell enables storage of data which are local to 
the cell and do not need to be transmitted (a typical example is matrix 
multiplication where each cell contains one column of the matrix, while, row 
elements are passed though the cells. Once the pipeline is full, each cell gets a 
row element in each cycle and multiplies it with corresponding column element 
stored in its local RAM). The address for the local memory also follows a 
regular pattern depending upon transmitted data (e.g., first element of rowl 
multiplies with first element of all the columns). Hence address can also be 
pipelined as data and this is done on adr-bus. 

ii ) Interface unit The interface handles all 1/0 operations between 

the host and the processor array. It receives all the input data from the host at 
a time, and supplies it to the processor array at proper time instants, and 
stores the output of computation. In this capacity the interface satisfies the 
1/0 bandwidth requirement of the high speed array. In addition to this, the 
interface produces the address for the adr-bus and generates control signals 
on the BC-bus for loading data and microprogram into individual cells. For ease 
of description we will refer IFU as cellO, and cells of the processor-array as 
cell i.. to... cell N. The IFU interacts with the host and array in two different 
states — host_state and array_state. These two modes will be discussed in 
detail in the next section. 

S 2.3 STATES OF OPERATION OF THE *SASP^ 

As mentioned in the previous section, the SASP system operates in two 
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different states. 

i ) Host_siate In this state, the interface allows the host to access the 
data memories and microprogram memories of the IFU (cell 0) and the processor 
cells. Host writes data and microcode onto respective cells before handing 
control over to the IFU and procsessor-array. 

i i ) ,4rray_state After data and program has been written during host 

state; execution occurs in array_state . In this state, the host in denied access 
to the SASP machine, and the IFU controls the data flow. Note that, data on the 
Y-bus can flow in either direction. Accordingly, the IFU treats data flow in two 
different configurations. 

a ) Forward mode In this mode, Y-data is fed to the leftmost cell (cell i) and 
output is taken from the output bus of the rightmost cell (cell N). 
b ) Reverse mode In this mode, Y-data is fed to cell N and output is taken 
out from cell i. 

In both the confqigurations, X-data is fed to cell i. Y-data can be circulated 
into the array through the interface. 


f 2.4 INTERCELL COMMUNICATION 

In SASP architecture, global communication requirements are greatly 
reduced. BC-bus is required only during loading of microprogram and data. But 
once execution starts, only local communication channels between cells are used. 
Because of this regular and highly local communication pattern, the processor- 
array can be easily augmented using additional cells. 


Each cell communicates only with its left and right neighbour. 
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Cdmmunication is done through three channels as described above. A one way 
protocol is used for data transfer. Each call has an input queue on each 
channel to buffer the data stream. When ceil^ sends data to celln+i along any of 
the three channels, it also sends the write signal to the corresponding input 
queue of cellrt+i- So it is the responsibility of cell^ to write the data into the 
input queue of cell^^x before operation at can take place. If the data is 

not written into the queue before cell^+i tries to read it, a queue empty signal 
is generated which blocks all operations of cell^+i until the data arrives. This 
blocking is transparent to the operation of the array in the sense, that other 
unblocked cells continue to operate as usual. Similarly blocking can occur while 
writing onto a queue which is already full. The cell sending data would remain 
blocked until a data is read from the full queue by the respective cell, thus, 
providing a free location to write data. 


S 2.5 SIGNALS OH BROADCAST BUS 

The broadcast bus (BC4xis> is used by the host to load data and 
microprogram into individual cells. This bus originates at the interface and 
connects all the cells of the array. Following are the signals that are present 
on the BC-bus. 

# BCDo— BCD 7 These are S-bit data lines which contain data or 

microcode to be loaded to a particular cell. The SASP architecture is proposed 
for 32- bit floating point operation and hence bus width of 32 bits would be most 
desirable. But in the current work, we were constrained by limited board area 
and therefore, we used 8 -bii data buses in oreder to limit the chip count and 
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thereby saving limited board space. 

• BCAq— BCA g They contain a 4-bit address to specify a particular 

cell. Most of the lines of the BC-bus enter the cells through a transparent 
latch which is strobed only when the address on these lines matches the address 
of the cell. At present, a maximum of fourteen cells (=2'^— 2) can be connected to 
the array. One address is reserved for the IFU and one is used as broadcast 
address. A broadcast address is provided to speed up the loading process in 
case all the cells do the same function and therefore, use the same 
microprogram. 

• LjXc This line is made high when a microcode read/write is 

intended. During execution this signal must be made zero. 

• Lw t This line is made high for reading/writing data from/to 

the cells. This should also be made zero during execution. 

• Clr Clears the counter that selects different banks of 

microcode RAMs (liRO, liRi etc. See § 5.6 for details). 

«> Ucs It is asserted each time a microcode loading process 

is started at a particular microcode memory bank. Putting a zero on this line 
puts the microengine of the addressed cell in 'writable control' mode which will 
be discussed in detail in chapters. 

• RST This is a software reset signal used to reset the 

microengine of a particular cell (Chapters). 

• SYNC It is used to trigger all the cells into execution at the 

same system clock cycle in a synchronous manner. 

« Wrlic This is the write pulse to write data or microcode in 

their respective RAMs. 


• Rdiic 


This takes care of reading of data or microcode. 
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• Fig This is a handshake signal for microcode read/write 

and will be explained in f 5.4. 

• ClkB This global clock is given to all the cells and the 

whole array operates in synchronism with this single clock. 


f 2£ FEATURES OF 'SASP' ARCHITECTURE 

The linear array structure of SASP array simplifies interconnections 
and data flow control to a great extent. Because of locality and regularity of 
interconnections, the array is expandable by addition of new cells and no extra 
modification has to be done. For an extremely large size array, global 
synchronization can cause a problem, but with the facility of run time flow 
control described in § 2.4, a switchover to wavefront principle can be easily 
done. Application range of SASP can be widened quite easily. An algorithm, 
currently not supported by SASP because of limited computation power of the 
cells, can be brought under its purview by adding an upgraded cell to the 
existing array and by mapping the algorithm in such a manner, that the special 
computation demanded by the new algorithm is confined to the new cell only. 
Since each cell of the SASP array is programmable, it can also be used as a 
SIMD or MIMD machine to execute non systolic algorithms. 
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MAPPING ALGORITHMS ONTO SYSTOLIC ARRAYS 

I 3.1 INTRODUCTION 

Algorithms are typically written in sequential code. This causes an 
ordering in the program which may not be required at all, as far as the final 
output is concerned. For example, let us consider the mathematical expression to 
compute the addition of two matrices A and B. 

C = A -fB ; Cij = 

The corresponding FORTRAN code can be written as, 

DO 10 j * 1 to N 
DO 10 i = i to N 
C (i,j) = A (i,j) + B (i,j) 

10 CONTINUE 

Although all the elements C (i,,) can be computed simultaneously, the above code 
instructs the sequence of operations in a column major order. This ordering 
feature of sequential codes doesn't consider the parallelism inherent in certain 
operations performed, and therefore, sequential codes are not fit for algorithms 


intended to be executed on a parallel machine. 
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Systolic array is a member of a special class of parallel machine 
known as array processors. Its distinctive characteristics are — local 
communication, balanced distribution of load among processing elements, 
pipelining, regularity, modularity etc. To process an algorithm on a systolic 
array, these features of the machine must be exploited so that, the algorithm can 
be executed efficiently on it. A number of different methods have been proposed 
in this direction to explore the data dependency and parallelism of an algorithm 
Cfioldo83, Sykung87, LeisSSl, and to match it to the required array topology. In 
this chapter, a simple yet quite generalised approach as proposed in [Sykung88] 
will be described through mapping of some common signal processing algorithms. 


S 32 DESCRIPTION OF ^GORITHHS 

This is an important step in an algorithm oriented design. The ultimate 
design should begin with a powerful algorithmic notation that is easy to 
understand and at the same time expresses the recurrence and parallelism 
associated with the algorithm. 

There are many different expressions that can be used to represent a 
parallel algorithm such as snapshots, recursive algorithms with space time 
indices, parallel codes, dependence graphs (DG's) etc. A VLSI algorithm is often 
very regular and the computation activities are conveniently expressed by a 
simple grid model called DG and from the DG representation, the array 
configuration can be easily determined. We will follow the Dependence Graph 


approach here. 
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The first step towards designing the Dependence Graph is to convert 
the given sequential description of the algorithm into a form called single 
assignment code: 

" A single assignment code is a form where every variable is assigned only one 
value during the execution of the algorithm. " C SykungSS 3. 

Example i, LINEAR CONVOLUTION ALGORITHM 
Convolution of two sequences x.; and w- is defined as 

V J 

i i 


y. = 2 * ^j-k = 2 ’^i-k * ^k 

^ k=0 k=0 ‘ 

If u is of length N, and w is of length P than j = 0, 1, 2, 3j 4, , N+P-2 


Note, that in the above expression y^ is overwritten many times and thus 


assigned more than once. The single assignment version of the algorithm becomes 

3.1 


yj*" = + x^‘^ * Wj*^ 


where yj^ = 0 ; 






w. 


'a - '^j-1 

Here k can be viewed as time index and j as space index. But from the 
mathematical point of view, they can be treated equally, and, j and k together 
make the index space of the algorithm. 

Example 2^ AR FILTERING ALGORITHM 


k-l 


0 

w^ =W^ 


It is described by the following relationship, 

N 

Uj = S * yj.l< + Uj 

k=l 

The single assignment form can be given in two ways 

i) * --t * 


ii ) 


where 




yj = 


- a 

i - ®i-l 


'-1 


= ai, and 


u N ^ U. 

“ ^3 


k+i 




N+l 


= u,- 


a k _ k 


*-l 


= aj^ and y^ = y 


3.2 


3.3 


where 
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g 3.3 DEPENDENCE GRAPH <DG> 

A dependence graph ( G = (N,A) ) represents graphically the computations 
that occur in an algorithm, where N is a set of nodes which represent the 
computations and A is a set of arcs representing dependency between 
computations. An arc whose endpoints are the same is called a loop. A chain is a 
sequence of arcs (aj. . . .an) such that arc a^ (2:irin-i) has one endpoint common 
with arc a,_i and its second endpoint common with arc a^+i regardless of the 
direction of the arcs. A path is a chain formed with arcs which are directed 
the same way. A cycle is a path whose endpoints are the same. 

It is to be noted that, a node represents an operation but the details 
are deliberately ignored, A DG can be computable if it does not contain any loop 
or cycle. A shift invariant DG is one in which the dependency arcs connected to 
all nodes are independent of their positions. In the following discussions, all 
the DG's are assumed to be shift invariant unless otherwise specified. However, 
in array design, the 1/0 nodes (boundary nodes) are exempted from this 
requirement . 


5 3.4 MAPPING METHODOLOGY 

§ 3.4.1 DESIGN STEP 1. MAPPING ALGORITHM ONTO DG 

An algorithm is first written in single assignment form. The space-time 
index space of the algorithm corresponds to the lattice space of the DG. The 
data dependencies are shown very clearly in single assignment form, and those 


dependance relations can easily be translated onto DG by connecting 
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corresponding nodes. The DG for convolution algorithm and AR filtering 
algorithm are shown in Fig 3.ia & 3.1 .b respectively. 

§ 3.412 DESIGN STEP 2 : MAPPING DG OtfTO SFG 

Once the DG has been formed, it is possible to design the array by 
assigning a PE to each of the nodes. But this will prove highly inefficient since 
each PE will be operating only for a small fraction of the total computation 
time. To utilize the PE's more efficiently, the DG, therefore, should be mapped in 
a more compact intermediate form called SFG. 

Signal flow graph (SFG) 

This is also a graphical representation of the algorithm concerned, but it gives 
a more concise and specific description of the operations, as compared to DG. 
Unlike DG, SFG has a functional part which specifies operations performed at 
each node. As in DG, the structural part of SFG consists of vertices V 
representing computations, and a set of edges E indicating data dependencies; in 
addition, there one more set D(E) which gives number of delays associated with 
each edge. 

The operations at the nodes are assumed to be performed in zero time 
and any delay in computation is shown explicitly on dependency arcs. This eases 
the analysis of space time activities associated with pipelining. In contrast to 
DG, an SFG can contain loops or cycles as long as there is at least one delay in 
that loop or cycle. 

SFG projection procedure 
This consists of two parts : 

i) Processor assignment ; This step is concerned with the problem of 



Fig. 3 . 1 . a DG FOR CONVOLUTI ON 


Uo Ui U2 U3 

i i i i 


Ua U| 

i I 
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assigning an operation to a particular processor. Since algorithms generally 
implemented on systolic arrays are characterized by high degree of regularity, 
a linear projection for processor assignment is almost sufficient. Bui for 
algorithms which involves irregular data-dependencies, non-linear assignment may 
be necessary. Standard signal processing algorithms are quite regular and 
hence, only linear projection assignment will be treated here. 


In this type of processor assignment, ail the nodes of the DG in a 
straight line are projected to a single node in the resulting SFG. The projection 
direction Is denoted by the vector d. 

ii) Scheduling The scheduling scheme specifies the order in 

which operations are sequenced in all the nodes of the SFG. The scheduled 
execution time of a node is denoted by a time index. A linear schedule, denoted 
by s maps a set of pareallel, uniformly placed, eauitemporal hyperplanes of the 
DG to e set of linearly increased time indices (s is the vector normal to the 
equitemporal hyperplanes). The time index of a node i can be represented by 

-jT 

s . i where i is the index of the node. 


To obtain a valid SFG from a given DG the projection direction d and 


the sohedule vector s have to satisfy two conditions as given below. 

i) Condition for causality : The permissible schedule vector must obey the 
data dependency of the DG, i.e. if node i depends upon the output of node j , then 
3 must be scheduled ahead of i. This can be described by the following 
expression ; 


s . e i 0 for any dependency arc e 3.4 

-4 *4 

ii) Condition for parallelism ; If s and d are orthogonal to each other i.e. if 
d is parallel to the equitemporal hyperplanes, then sequential processing will 
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result. Thus the condition for parallelism can be described mathematically as : 

5 . d > 0 .3.5 


The mapping process from DG to SFG can be summarized mathematically as 
follows. 

-+ -4 

i) Node mapping The mapping of a node 1 of the DG onto a node j 

in the SFG is found by 


-+ T -♦ 

i = P' . i 

“+ -4 

where PisaNx^N—i) matrix orthogonal to d i.e. P’ . d = 0 
It is to be noted that the mapping method maps an N dimensional DG onto an 
(N— i) dimensional SFG. 

ii) Arc mapping : This maps the arcs of the DG to the edges of the 

*4 *4 

SFG and also gives the delay D (e) associated with each edge. For a given arc a, 
corresponding D ( e ) and e can be determined by 


D < e ) 

-4 

e 

iii) I/O mapping : Since, I/O node of the DG do not satisfy the 

condition of shift invariance, special treatment is required for these nodes. The 

^ ^ -4 

SFG node position j and 1/0 time t (i > of an I/O of the DG node i can be 
derived as; 



-4 

r 

t < i ) 


-> 

II 

1 

1 

[ 



Example i SFG FOR CONVOLUTION 

The DG shown in Fig 3.1. a can be projected in several different ways by 
choosing different directions of projection . If we choose to project along j axis 
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-^1 i 

i.e. With d = Q > permissible linear schedule s would be ^ , although there 

can be other valid choices. The processor basis would he j^O i j. 


i) Node mapping : 


1 = [0i] 


- [^] 


ii) Arc mapping : 


D < e ) 


i 1 0 

Oil 


i 2 i 
Oil 


iii) Input mapping : 


t ( i ) 


1 i 
0 i 


i 0 
0 k 


iv) Output mapping ; 


t ( i ) 


0 i 


The resulting SFG is shown in Fig 3. 2 .a. 


Example .2 SFG FOR AR FILTERING ALGORITHM 


d =[10] 


[ 10 ] 


pT - [ 0 1 ] 


i ) Node mapping 


[0 1 ] 


" [^1 


ii) Arc mapping : 


iii) I/P mapping ■ 


i 0 
0 i 


i 0 
0 1 


i 0 i 
0 -i i 


i 0 i 
0 -i i 
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iv) 0/P mapping : 

i 

0 

0 

i 

* 

■ il 

3 

= 

3 

3 





- 


- 


The SFG is shown in Fig 3.2.b, 

S 3A3 DESIGN STEP 3. HAPPING SFG ONTO SYSTOUC ARRAY 

The SFG obtained in steo2 can be realised by a multiprocessor array. 
For an algorithm having a DG of N dimensions, stepC of the mapping procedure 
gives an SFG of dimension N-1. Direct array implementation will require an array 
of the same dimension which again may not prove cost effective. Solution to this 
end can be found by going through step2 repeatedly; thereby, reducing the SFG 
dimension by one in each iteration, until the resulting SFG achieves a feasible 
dimension. This process can be carried out only as long as the SFG remains 
shift invariant. Furthermore, extra contraints different from those mentioned in 
step 2 would also have to be satisfied. This will be demonstrated in Chapter-4, 
with reference to the mapping of LU-decomposition algorithm onto a linear 
systolic array. 

When the SFG achieves the desired dimension, a little retiming is 
required to convert it to a systolic array. The SFG itself represents an array 
structure which differs from the systolic array architecture in the following 
aspects . 

i ) Broadcasting of data is allowed in SFG, but not permitted in systolic 
array. 

ii) Even if the data is spatially localized in an SFG, it may not be temporally 
localized, as in a systolic array. Mathematically, a permissible systolic schedule 
is slightly more demanding than a valid SFG schedule (eq. 3.4 & 3.5). They are 
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listed below: 

-fT -4 -4 

s . e > 0 and s . d > 0 

The need for temporal locality can be alleviated easily by localizing the SFG. 


For example, on the index space <i,j) of the SFG, let a data a^^ is broadcast over 
all nodes having index (i^, j). This can be avoided by writing the variable aj^^ in 
the following manner: 



and 



If j is a spatial index then, the data a^ is localized spatially but not 

■*'1 

temporally, which is not permissible in systolic array. Temporal locality can be 
achieved by introducing additional delay and it has been proved CSYKung883 that 
any SFG can be temporally localized by this method. This method of systolizing 
an SFG is explained below with the example of AR filtering. 


The SFG for AR filtering (Fig 3.2 .b) does not satisfy the condition of 
temporal locality since there are edges with zero delay. It can be temporally 
localized by *cut-set retiming ' which is based on two rules. 

Rule i. Time scaling > All delays can be scaled i.e. D -4 ccD^ by a 

single positive integer which is also known as the pipelining period. 
Correspondingly, inputs and outputs also have to scaled. 

Rule 2. Delay Transfer : For any cut set of the SFG (a cut set 

divides the SFG in two different parts), the edges of the cut set can be grouped 
into inbound and outbound edges depending upon the directions of the edges. 
According to Rule 2, each outbound edge can be advanced by time units while 
each inbound edge can be delayed by the same amount. For a time invariant SFG, 
the general system behaviour is not affected because effects of advances and 


lags cancel each other: 
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Systolization of Fig 3.2 .b is done in two steps. 

i ) Time scaling : Let us scale the delay D by a factor 2 i.e. 

D 2D'' (Fig 3. 3 .a). 

ii ) Delay Transfer ; Applying cuts uniformly along the dotted 

lines in Fig 3.3.a, a delay D^oan be added to each rightgoing edge, and to 
compensate this, a delay D'' can be subtracted from every leftgoing edge. The 
resulting SFG is shown in Fig 3.3.b. 

Now, that the SFG has been temporally localized, it is ready to be 
converted into systolic array. For this, the only requirement is to introduce a 
delay into each of the operating modules. This delay, combined with the zero 
delay operation of a node forms a basic processor element or PE. Remaining 
edge-delays are pure delay elements. Since self loops are implemented as 
registers in the PE, they are also combined into the PE. Fig 3.4 gives a detailed 
illustration of this. 

The systolic arrays for convolution algorithm and AR filtering 
algorithm can now be easily found as shown in Fig 3.5 .a and 3.5 .b respectively. 


5 35 L,4RGE SIZE PROBLEMS 

So far, the discussion of mapping technique has been done with an 
implied assumption that an array has sufficient number of PE's to execute the 
algorithm in a single pass. The advantages of systolic or any kind of parallel 
computing are perceptible only when the problem size is sufficiently large. In 
practice, it is not possible to increase the size of the array beyond a certain 





Fig. 3.3 Systolizati on of SFG for AR-Filtoring 
by , a) Time Scaling 


b) Delay Transfer. 



from a Mode in 
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limit and , therefore, if the problem size is large compared to the physical size 
of the array, the problem should be partitioned into a number of subproblems 
which can be directly mapped onto the given array size. 

Consider a DG having N dimensions and let it be projected onto an SFG 

of dimension N-i and of size SiXSaXSg, Xs^-j.. The SFG is partitioned into 

niXnsXna XnN_i blocks, where each block is of size diXd^Xds . . . . 

-XdN-i. 

Clearly, for i = i .... N-i, d,- = s,-/n,-. 

There are two methods for mapping the SFG onto an array ; 
i ) Coalescent or locally seauential globally parallel (LSGP) method. 

In this method, each block is mapped into a PE. Each PE sequentially 
executes the nodes of the corresponding block, while all the blocks get executed 
parallely in different PE's. Hence, the number of blocks should be equal to the 
number of PE's in the array. This scheme requires that each PE has sufficient 
local memory to store the node data. 

-4 

For a given N dimensional DG, and a projection direction d, let the 

processor basis be P. For an acceptable schedule s, the diXd 2 Xd 5 Xd^si-i 

nodes residing in a block should be executed at different times, so that, at any 

instant, at most one node is active in each block [SYKungSS, Navar873. 

ii) Cat and pile or locally parallel globally sequential (LPGS) method. 

In this scheme, the blocksize is chosen to match the array size and all 
nodes in a block are executed concurrently (locally parallel). After one block 
has been finished, data for the next block is fed, and thus blocks are processed 
in a sequential manner. In this method, the memory requirement of each PE is 
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constant and independent of the size of the problem. 

The LPGS method calls for careful node assignment. Since blocks are 
executed sequentially, they can not have any reverse data dependency between 
each other. Essentially, in this method, the algorithm is partioned by a 
transformation ^ ; where ic® divides the index set into a number of blocks, and 
w determines the sequence of operations inside the blocks CMoldo863. 

A comparative pictorial illustration of LSGP and LPGS methods is given in Fig 
3.6. Besides the above methods, some other techniques also have been proposed . 
They are briefly discussed below. 

i) Dense to band matrix transt'ormation (DBT) CNavar873. 

This method is particulary suitable for matrix computations. Any matrix 
can be viewed as a band matrix of appropriate bandwidth. In band matrix 
approach, a matrix computation can be performed in a single pass through a 
fixed size array, as long as the bandwidth of the matrix is less than or equal to 
the array size. Hence, the size of the problem depends upon the bandwidth of 
the matrix and not upon its dimension. 

In DBT nr^thod a large matrix is first transformed into a band matrix of 
larger dimension, but having a bandwidth equal to the array size. Then the 
matrix can be computed on the fixed size array. 

ii ) Algebraic partitioning 

In this approach, a big problem is first decomposed algebraically into a 
number of smaller subproblems. For example, a 2D-convolution can be expressed 
by sums of products of different i-D convolutions. The details of this scheme 
can be found in [SYKung883. 
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MATRIX COMPUTATIONS ON SYSTOLIC ARRAY 

f 4.1 INTRODUCTION 

Many scientific and technical applications require high computing speed; 
a large number of them involve matrix computation. Systolic array structures 
are quite amenable to matrix computations and this has triggered a host^ of 
activities to find array structures for different matrix algorithms. As a result, 
we have various linear array models to solve triangular linear systems, compute 
convolution of two sequences, perform AR filtering etc. All these algorithms can 
be executed directly on the linear stucture of SASP (Chapter-2). Two 
dimensional models to compute matrix multiplication, LU-deoomposition, QR 
factorization etc. are also available with a number of choices. To implement 
these mesh algorithms on a linear systolic array, careful mapping is required. In 
this chapter, mapping of LU and QR decomposition on a linear systolic array will 
be discussed. Mapping of matrix multiplication is simple and can be done without 
going through systematic design procedures step by step and this will be shown 
later in conjunction with execution of algorithms on a single systolic cell. 
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5 42 M/<PPING OF LU - DECOMPOSITION 

!n LU-decomposition, a matrix A decomposes into an upper triangular and 
a lower triangular matrix. To get an unique factorization, this process can be 
broken into two steps. 

1. A = LDl/ 
where, 

L is a lower triangular matrix with ... . 1^^ = i if i = j & l^^j = 0 if i < j 
t/ is an upper triangular matrix with . . = i if i ~ j Sc u^ =0 if i > 0 

D is a diagonal matrix 

2. A =: LU 
where, 

U = DU^ , which again is an upper triangular matrix. This ensures uniqueness 
of both U and L. 

N 

Now, a^j = ^ l|p dpp where i = l N 8cj=i N 

P i 

but, as per definition, l^j = 0 for i < j and = 0 for i > j 
min(i,j) 


Hence, a^j 

= 2^ ^ip '=‘dp ^^03 


p = i i-i 

for i =s if 

®ii “ '^ii ^ip ^PP ^ P3 


p = i 


i-1 

or, 

^ii " ®ii “ ^ip ^PP ^ P3 


p = i 


and, 


^li = 

®ii ~ '^ii ^ ij ^ip ^PP ^ pj 

p = i 


for i< j 
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OP; 


or, 


for i > i 


or, 


or, 


"13 


^ a ^ ^ij 


i-i 

22 ^iD '^PP 

P = i 


'"ij = ® 


i-1 

1} ~ 2i^ ^Dj ^Dj 

P = i 


'10 


li3 d. 




iU '^PP ^ DO 


P =: i 


lii 


^ij 


iip dpp u pj 


j-i 

Hi - Z 

P = i 
0-1 

'io - 2 ^ ^PO ^PO 

p = i 


/ d 


00 


*^00 


The relations can be 


*io 


= a 


10 


®i0 ®13 


represented by the following recursive algorithm. 

~ ^ik ^kj 


lik = 0 
= 1 

= a^i^ / U|^|^ 


if i< k 
if i = k 
otherwise. 


Uj^j = 0 if 0 < k 

= a^j*' if 0 ^ k 

Before mapping this algorithm onto a systolic array, a single assignment 
description of the above will be helpful. This is obtained as follows. . 
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Single assignment fontn : 
for, k from 1 to N 
i from k to N 

j from k to N 

u,,/ = if i = k 

otherwise. 

if i = k 
otherwise. 

k+l ^ a ^ 1 ^ ij ^ 

where, a^/ = and 

The different steps of mapping are described below. 

Step 1. E>G design 

The DG for this algorithm can be obtained by comparing the index differences 
between each assignment statement. The DG is shown in Fig 4.1. a. 

Step 2. SFG design 

The DG shown in Fig 4.1. a can be projected onto a two dimensional lattice in a 
nptfnber of ways. Choice of a particular direction depends upon the array 
structure on which the algorithm is to be executed. In case of SASP we are 
guided by the following requirements. 

i) All I/O operations should be done through boundary nodes only. 

ii) Any special function (e.g. division, square root) should be confined to a 
particular node rather than being distributed over all the nodes. This enables 




Outputs are 
produced at darkened 
n odes . 



Fig. 4,1. a) DG for LU Decomposition 




•m'r y m. 



Chapter 4 


40 


handling of the special functions by addition of a new cell to the already 
existing array. 



It is found that, for mapping onto SASP, projection along the vector d^= 

i i J appears suitable. Next, we choose the transformation matrix p"^ = 

0 -i 

1 -1 


T -4 

such that P d = 0. 


i) Node mapping •• 


i 

0 


0 

i 



i 


i-k 


i 

k 

= 1 

j-k 


With a schedule vector s"^ = 1 1 1 J, we find the other mappings as 

follows : 

ii) /4rc mapping ■■ 


i 

i i 


i 

1 

o 

o 


1 

I 

i 

0 

0 -i 

1 -i i 

J 


0 

0 

1 0 

0 1 


1 

0 

0 -1 

i -i 


iii) Input mapping ■■ 


iii 


i 


i-fj+i 

1 0 -1 


i 

= 

i-1 

0 i -i 


i 


i-i 


iv) Output mapping ■ 


ill 
10-1 
0 1 -i 



i 

k 

I 

k 

i 


k 

k 


_ 



i+2k j4"2k 
i-k 0 

0 j-k 
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The projected SFG thus obtained is shown in Fig 4.1.b. 

Step 3. Projection of 

To get a linear array, the SFG obtained in step 2. has to be projected once more. 
Before doing this, some additional care has to be taken. 

i) Projection in any direction of the SFG, shown in Fig 4.i.b, would result 
in input operation to interior cells. To avoid this, the index space of the SFG 
has to be extended. This is shown in Fig 4.2 .a. 

ii) Outputs of the SFG, shown in Fig 4.i.a are stored in respective nodes. 
They can be taken out at a time, after execution of the entire algorithm has 
been finished; or they can be outputted intermittently everytime an output 
element gets computed. The second method reduces the bandwidth requirement at 
the output, but it also reduces the scope of pipelining a number of problems of 
the same kind. For taking out the outputs, immediately after they are produced, a 
little enhancement of the current SFG is required which has been shown in Fig 
4.2.a. 


The modified SFG can now be projected onto a linear array which 
satisfies all the requirements. This is done with d^ = ^ = [i i j 

Hence, P^ ~ [ ^ ^ ] 


i> 


Node mapping ■■ 


[1 0 ] 


= [0 


ii) Arc mapping : 


iii) I/O mapping ■■ 


i 

i 

i 

i 


1 

0 

i 

0 




* 


1 

0 

2 


0 -1 
i -1 

[ 2+j 



1 1 -2 
10-1 


Let the delay element introduced in this step is t, which is different from the 
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delay D, already present in the SFG. 

The relationship between D and t is governed by the following constraints : 

i) Processor a^/ailaUty : For this, 

D 2 r + (N-1) ( d ) T 

Where N is the maximum number of nodes along the d direction. This condition 
guarantees that there is no overlap between two activity-instants which are 
separated by a delay of D. 

ii) Data ai/ailibilty ; For this, 

a) 0 D + ( s^. e ) T ^ 0 for all edges 

b) <3D + (s’^. e)T^T for at least one edge in every cycle, 
where e is an edge of the SFG and <3 D is the delay associated with that edge, 
condition a ) ensures causality of operation. 

condition b ) ensures that every operation cycle in the new SFG involves at 
least a single delay. 

In the present case N = 5 so, D ^ 5 t 

We choose D = 5 t 

The new SFG is shown in 4.2.b 
Step 4. Systolizalion of the SFG 

The systolic array, derived from the modified SFG, is shown in Fig 4.2.c. In 
order to fit the SFG to the SASP structure, the rightgoing edges had to be 
multiplexed. Since transmission on these two busses never coincide, multiplexing 
does not introduce any extra delay. Operation of the cells in each clock cycle 
has been shown in table 4.1. The operations of the cells involve loops and 
computation depends upon the indices of the data-i terns. But they are quite 
regular, and control flow can easily be controlled through microprogramme 
residing in each cell. 



TABLE 4.1 
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The mapped array can be used to decompose any matrix of band 3, 
irrespective of its dimension. At a time three matrices can be pipelined; if 
outputs are taken at the end of execution of the whole algorithm. 

§ 43 MAPPING OF QR DECOMPOSITION 

The problem of QR decomposition has recently been given lot of 
attention after the developement of GIVENS SEQUENCES; which are well suited 
for parallel computation. 

In QR decomposition an mxn matrix A (m ^ n) is decomposed into two 

factors A = Q X R by applying a sequence of Givens rotations. Q is a 

f u] 

mX m orthogonal matrix and R = ^ where U is an nXn upper triangular 

matrix. 

T T 

The matrix R can be obtained from the equation R = Q A, where Q 
is a product of rotations, each of which annihilates an element below the main 
diagonal with the property in which zeros once produced are preserved. The 
rotation Q (i;j) eliminates the element a (i.j) of A and this process affects only 
row i and i-1 of A. This is done by the following pre-multiplication. 
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Thus the process of triangularization consists of steps given below -■ 


For, j = i, n 


aj vfr ^ =■ Oi*'' 

1 = m i = 1 


where, annihilates the i ,3 th element of the matrix jl * ^>-1 

pVm 

Order of annihilation of an 8X5 matrix is given below. 


7 

6 8 

5 7 9 

4 6 8 iO 

3 5 7 9 il 

2 4 6 8 iO 

1 3 5 7 9 


Before going through all the steps of mapping procedure, a brief 
description of the algorithm helps understanding of the problem. For this 
purpose let's assume one-to-one mapping of matrix elements onto processing 
elements. The process starts at PE (N-l,i). It fetches a|^ from, PE (N,i) and 
calculates Givens rotation parameters and In the next cycle, these 

parameters are propagated to PE (N-i,2), then to PE (N-1,3) and so on, while each 
of the PE's performs (including PE (N-i,i)) the rotation operation in 
corresponding clock cycle. After PE (N-i,l) finishes the rotation operation, 
PE (N-2,i) can start its turn to calculate and 


The single assignment form of the algorithm for a NxN matrix is given in the 


next page. 
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For, 


where, 


k from i to N-i 

i from N-i to K 


j from K to N 


for i = N-i 
for i N-i 


Cy. s I'-ly. 


^i,k 


Cy 


IJ 


AJCUi/ * Cx,/’ 


for j = k 


C: 


j-i 


i,k 


otherwise 


4,k 




N^Ui/ -H CXi/ 


for 3 = k 


= 


i,k 


j-i 


otherwise 


= 'i,!.’ «:«,/ * 'i,k’ Cki/ 


Mx 


1,3 


■^i,k ^^1,3 '*' “^ik ^^i,3 


MWi.-i = 


Nx 


1,3 ” “1,3 

N-1,3 “ ®N,3 


for, i = i to N-i, 3 = i to N 


and 


for, 3 


i to N 
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Step i. D<3 design 

The DG can be easily formed from the single assignment form. This has 
been shown in Fig 4.3. 

Step 2. SFG design 



2 . 3 ^ =[0 0 1 ] 

i ) Node mapping 


3T=[C0 1] ; 


i 0 D 
0 i 0 


10 0 
0 10 
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i 

ii) 

Arc mapping 
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i 

0 
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-i 
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0 

i 

0 


0 
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0 

1 
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iii) 
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The SFG obtained from this projection is shown in Fig 4.4.b. 

Step 3. Projection of SFG 

The SFG shown in Fig 4,4 .a can be mapped onto a linear array using one 
more projection along d"'^ = [ 0 i ] with 0 i j. This has been shown In 

Fig 4. 5. a. Preloading of inputs to interior cells can be avoided by modifying the 
SFG as shown in Fig 4.5.b. 

Step 4. Systolization 

Finally, from the modified SFG, the systolic array can be easily derived 
as shown in Fig 4.5 .c. 


S 4.4 EIGEN 1^/<LUES OF A SYMMETRIC MATRIX 

The QR decomposition algorithm can be used to find the eigen values of 
a symmetric matrix. In this method, a matrix A (nXn) is first converted into a 
tridiagonal form by a series of orthogonal transformation. Then a series of QR 
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algorithms are repeatedly applied to reduce the subdiagonal elements to zero 
CAlgSSl. The QR algorithm is described below, 

A = Q * R and R = 51 A hence, R*Q=Q^*A*Q 

Thus R I Q is similar to A, where A is a symmetric matrix. Therefore, repeated 
application of such similarity transforms can be used to find out the eigen 
values of the original matrix. 

Ir t the QR decomposition described in previous section, we carried out only the 

pre-multiplication by a series of orthonormai matrices Qi In addition to this, 

T 

QR algorithm also requires post-multiplication by . The post-multiplication 

operation will effect elements of column i and i-1, and since A is symmetric, the 
(j,i) th element will be annihiliated as a result of this. It is to be noted that for 
a symmetric matrix, the informations regarding the elements above the main 
diagonal are redundant and, therefore, can be ignored. 

The eigenvalue problem can be done in two steps. 

i ) T ridiagon&lization 

In this step matrix A is trasformed into a band matrix of bandwidth 2 by applying 
the QR algorithm. This is described below. 

For, j = 1, n 

^ Aj = f[ ai*A*QiT 

i = n i = i 

i 

where, annihilates the i,j th element of the matrix TT 

D = n 

ii) Reduction of subdiagonal elements 

Once the matrix has been tridiagonalized, the QR algorithm can be repeatedly 
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applied to reduce the subdiagonal elements to zero. This can be stated 
mathematically as. 

For, i = i, , n 

=. IT '• = f[ Qi * A * q/ 

i « j-4-2 i » 1 

i 

where, annihilates the i,j th element of the matrix IT Qpj * 

p = j + 2 

The algorithm described above can be efficiently executed on a linear array. In 
fact, use of a two-dimensional mesh array will prove inefficient, since, the PE's 
outside the band will be active only during tridiagonaltion, and therefore, remain 
idle for most of the execution time. The DG and SFG would be almost the same as 
those designed for QR decomposition, since data dependencies are quite similar 
in both the oases. A systolic array derived from the SFG shown in Fig 4.4 .b is 
shown in (Fig 4.6). The details of cell operations will not be discussed here. 
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IMPLEMENTATION OF INTERFACE CONTROLLER 

§5.1 IHTROBUCTION 

In Chapter 2-4 the SASP architecture and mapping of algorithms on this 
linear array have been described. Now we will discuss about realization of the 
system on hardware. From the overview of the architecture given in Chapter-2, 
we see that the SASP array consists of two kinds of basic functional units — 
interface and cell. Once these two units are developed, the systolic array can 
be realized by connecting an array of replicated cells to the interface. In the 
current work, it was decided to build the interface first,— the rationale being 
that once a proper interface is made, one can have an array of different 
degrees of versatility using processing elements of varying complexity. A cell 
can be anything from a simple MAC unit to a sophisticated computational unit 
depending upon applications, whereas the functions of the interface are more 
clearly defined and hence it can be made with a design that needs modification 
less frequently. 

In this chapter, the design and implementation of a part of the 
interface is discussed. The SASP machine is recommended for 32-bit floating 
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point operations, but for implementing a 32-bit machine, chip count becomes 
prohibitive considering the PCB-layout facility available here. So the prototype 
was attempted with 8-bit wide data bus and even then, to accommodate a large 
number of chips we used wire-wrapping technique instead of going for PCS 
mounting. Before discussing the design aspects, it is necessary to have a 
detailed identification of the functions of the interface. 


S 5.2 FUNCTIONS OF THE INTERFACE 

The systolic array requires input data at a particular rate determined 
by the algorithm to be executed, and it gives the output data at a rate which is 
also specific to the problem. During execution of large size problems on a fixed 
small size array, intermediate results are generated, which have to be fed back 
again into the processor-array until the final output is available. 

All these data handling operations cannot be done by the host with 
which the systolic array is used as an attached processor. The reason is; the 
supply rate and reception rate of data have to match the speed of operation of 
the array. Furthermore, it is more desirable to keep the array operation totally 
transparent to the host operation. 

Moreover, as shown in the SASP architecture (Fig. 2.i), the adr-bus is 
used to pump address patterns through the cells., The address sequence has to 
be generated at the interface. Thus the interface should act as a competent 
source for all interconnection channels (X-bus, Y-bus, adr-bus), as well as a 
suitable sink for intermediate results and output. 
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There is one more function which the interface is required to perform 
— loading of data and microprogram into each cell. As discussed in Chapter-2, 
all the cells are connected to a global BC-bus, which is used to dump 
microprogram and resident data in respective cells. The control of this BC-bus 
rests with the interface. A list of jobs for the interface is given below. 

• Host-state 

i ) Receiving data from the host and sending output to the host, 

i i ) To direct proper sequence of controls for loading data and microcode 
into individual cells. 

i i i ) To take over control of the array from the host before execution 
starts. 

• /irray-state 

i ) To supply X and Y data to the array at required rate. 

i i ) To route data according to the configuration of the array (forward or 

reverse). 

i i i ) Receiving intermediate results and looping them back into the array, 
i V ) Receiving and storing output results . 
v) Generation of address for the adr-bus. 

vi ) Returning control back to the host through an interrupt when execution 
is over. 


§ 53 BLOCK DIAGRAM DESCRIPTION OF THE INTERFACE 


The block diagram of the interface is shown in Fig 5.i 




ADDRESS 


Fig. 5.1 Block Diagram of the Interface 
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The interface can be functionally divided into four functional units. 

1. Host interface unit (BUFFER!, BLFFER2, XCEIVER, DECODER) 

The host interface unit buffers the address, data, and control lines 
coming from the host. The decoder selects proper blocks (e.g. CKTRL REG's, XIU, 
YIU) for writing or reading of data. The host considered here, is a PC -XT and 
SASP is mapped on its I/O address space. The control lines going to the host 
are Ready and Interrupt signals. 

2. Data interface unit (XIU, YRJ, X-bar, MX, EWX) 

Before asking the systolic array to start operation, the host dumps the 
data onto the RAMs of the data interface unit, wherefrom these data are 
transmitted from cell to cell. XIU contains X-data which, during operation, is fed 
to cell!. YIU sends Y-data from RAM Y , stores output results in RAM Y ; and 

A A 

receives and recirculates intermediate results through RAM Y and RAM Y^^. For 
handling intermediate results, two RAMs have been used, so that simultaneous 
read and write can be supported by accessing both of them at the same time (i.e. 
writing into one RAM and reading from the other). X-bar provides appropriate 
routing of data channels for forward and reverse mode of operation (f 2.3), and 
MX, DMX are used to send/receive control signals to/from appropriate cell in 
these modes. 

3. Host side control (HANDSHAKE & SYNC, CNTRL REG's) 

The host controls the interface operations during data and 
fnicpoprogram loading through setting some bit of the CNTRL. REG's, 


which are 
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mapped as output ports on the host address space. Data transfer between host 
and interface is done by handshaking. As the PC initiates a read or write cycle 
meant for the interface, the Ready line goes low, the HANDSHAKE & SYNC Lr»it 
generates local read or write pulse, and after completion of the required 
operation. Ready line is asserted to indicate the end of operation. 

4. j4rray side control and address generation unit (IFCU, AGU) 

The IFCU takes over control of all the blocks after host has finished 
loading of microcode and data, and supervises all kinds of routing and exchange 
of data during execution of an algorithm. After the execution is finished, it 
generates an interrupt which is passed to the host through the host interface 
unit, asking the host to take necessary action. The AGU section generates the 
address pattern for the adr-bus. 

Although access of data by processor array generally occurs at 
regular intervals, the regularity can be disturbed for large-size problems where 
the fixed (small) size of the array requires partitioning of the main problem. 
This indicates that tracking of instants of data access, by event counting 
through up/down counters, lacks generality. For this reason it was decided to 
put the interface under microcode control which increases the versatility of the 
unit. 


The AGU is functionally equivalent to a self contained ALU and is 
capable of generating complicated data dependent address patterns under the 
control of a rich instruction set. The instructions come from the microprogram, 
and thus the process of address generation has a close coupling with microcode 
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sequencing. 

In the following section, implementation of the IFCU and AGU, as done in this 
project, will be discussed. 


§ 5.4 DESIGNING 'IFCU' & 'AGU' 

8 5.4.1 INTERF,^E CONTROL UNIT (IFCU) 

The IFCU has been implemented as a programmable microengine. At this 
point it would be relevant to discuss the principle and some of the advantages 
of microcode control in a nutshell. 

It is a standard practice to use microprocessors to control data flow 
and to perform computations in a 'smart' circuit. But the inherent sequential 
nature of microprocessor operation prevents its use in high throughput 
machines, where more functional parallelism is required. 

This parallelism can be achieved by using microcode approach. The main 
difference between microprocessor circuits and microcode circuits is that the 
functional units integrated in a microprocessor are spread out as separate 
building blocks in a microcoded circuit, so that they can operate simultaneously 
and independently. 

In order to coordinate the independently operating devices, these 
functional blocks are operated in tandem in synchronism with a common system 
clock. Control instructions for all the devices are put together in a central 
microcode memory. Instructions are stored in each location of the memory, and a 
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memory location is accessed in every system clock cycle. The instruction can be 
group of bits for VLSI devices or can be single bit control signals for SSI or 
MSI devices. 

In a high performance system, mere seauential addressing of microcodes is not 
sufficient. It is desirable to have a sophisticated program flow, that 
accommodates nested loops, subroutines, interrupts etc. Such demanding 
seQuencing requirements are met by using a program sequencer. Like other 
functional units, the sequencer also gets its instructions from the microcode 
memory and generates the address for the next instruction depending upon the 
current instruction. 

ADSP-i40i program sequencer has been used as the microprogram 
controller. The preliminary block diagram for the IFCU and the AGU has been 
shown below (Fig-5.2). 



FIG. 5.2 Preliminary block diagram of 'IFCU' & 'AGU' 
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The pipeline register is used to latch the control signals for the functional 
blocks. Since the microcode memory is accessed in each system clock cycle, data 
from the memory is unstable for some time during memory access. This can cause 
uncertainty in the operation of the devices controlled by the microccxJe. 
Pipeline register prevents this by latching data only when it is stable. This 
function will be clear from the following timing diagram (Fig 5.3). 



FIG. 5.3 Pipeline latch timing 
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S 5A2 ADDRESS GENERATION UNIT <AGU) 

The interface has to generate the address on the adr-bus which flows 
systolically from cell to cell. The ADSP-i410 address generator has been used 
for flexible address generation. The ADSP-14iO's architecture features a i6-bit 
ALU, a comparator, and 30 i6-bit registers. In a single cycle the device can 
output a i6-bit address, modify this address, and can do conditional loop back to 
the top of a circular buffer. Consequently it supports modulo addressing without 
any overhead. 1410 can do bit reversal at the output, which is very important 
for FFT algorithms. 

ff 5.43 DOUNLOADING OF MICROCODE 

Generally, microprogram of a system is burned into a ROM. But in the 
present case, function of the IFCU is determined by the algorithm to be executed 
and the microcode for the control unit changes with different algorithms. Hence, 
a flexible controller needs to be reconfigured dynamically. This can be achieved 
by downloading the microprogram from the host, before starting operation. 

ADSP-1401 has a special instruction (WCS) to support this operation. In 
this configuration (Fig 5.4), microprograms are loaded in RAMs and additional 
circuitry is added to support download. 

WCS instruction : 

The WCS instruction puts the sequencer in a mode in which sequential addresses 
are output every cycle and instructions are ignored. The Flag input is used for 
handshaking. When the WCS instruction (20 hex) is presented to the instruction 
port of the sequencer with the starting address (for microcode RAM) presented 
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to the data port, the instruction (20 Hex) is latched with the rising edge of the 
next clock cycle and subsequent instructions are ignored. 

After receiving this instruction, the program sequencer outputs the 
starting address presented at the data port and waits for the Flag input to be 
high. Once the Flag is high, the sequencer increments the address and outputs 
that in the next clock cycle. The Flag input is used for handshaking and can be 
made high permanently for fast devices. 

The download mode can be terminated through an interrupt or reset. In 
present circuit, termination has been done by external reset. 


PRCXSiRAH address 


WCS. 


host write 
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FIG. 5.4 Circuit for downloading microprogram 
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In Fig 5.4, the WCS instruction and the starting address are given through 
hardware. For this, a control bit called Ucs is made logic low and the buffers 
containing the WCS instruction and starting address are enabled, and they are 
put on appropriate buses. With this the sequencer goes into WCS mode. The write 
signal from the host is synchronized to generate the Flag input. 

After the downloading has been finished, the host resets the sequencer, 
clearing WCS and causing a jump to location 0 (the first location of the 
microprogram). 

8 5.4.4 DATA TRANSFER BETWEEN 'IFCU' & 'AGU' 

The ADSP-i40i and ADSP-i410 has a i6-bit data port each. Allocation of 
two different microcode fields to these ports would have made the arrangement 
most flexible, but it would cause inefficient utilization of the memory space. 

A reasonably flexible scheme for data transfer among microcode memory, 
address generator and program sequencer is given in Fig 5.5 .b. The output and 
input arrangement of ADSP-140i and 1410 permits data to be output during clock 
phase one, while inputting of data is performed in clock phase 2, thus allowing 
reading and writing of data in single clock cycle (Fig. 5.5.a). 


The circuit of Fig 5.5 .b allows following data transfers in a single clock cycle. 

• DF-* 1401 or DF -4 1410 ( Ken = 1, Den = Dstb = 0) : Required for 

initialization of 1401 & 1410 registers. Supplies 1401 with direct, indirect & 




Fig. 5.5.b) Scheme fo 


r Data Transfer between IbCU and AGU 
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relative program addresses, and 1410 with immediate data addresses or offsets. 
• 1410 1410 ( Ken = 0, Den = Dsib = 1) : It allows 1410 to save its 
registers in 1410's stack during a subroutine call, and allows 1401 to use the 
1410 as an ALU to calculate program address. 

Timing diagram of these data transfers are shown in Fig 5.6. 
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FIG . 5 . 6 Data transfer timing 







Chapter 5 


69 


5 55 CIRCUIT DESCRIPTION 

It has already been mentioned in Chapter-2, that we treat the interface 
as a special cell (cellQ). In line with this configuration, the rOJ gets its 
signals from the BC-bus (| 2.5) when the cell address ( BCAo—BCAa) matches the 
address of cellQ. 

IFCU has eight 8k X8 RAMs for storing microprogram and therefore, 
each microcode word is 64-bit wide. The host controls the states of the control 
unit through a number of control bits which come from a set of control 
registers (CHTRL REG's). These control bits are listed below (a brief outline of 
their functions has already been given in Chapter-2; more details will be given 
along with the circuit description.): 

i. Lite 2. Sync 3. Clr 4. Rst 5. Wes 

6. Host/ array 7 . BCA0-BCA3 

Furthermore, for writing and reading (microcodes need not be read as they are 
loaded by the host only. But this option has been kept to facilitate testing) of 
microcode, IFCU uses the following signals from the Handshake & SYNC block. 

1. Write 2. Rdlic 3. Fig 

Circuit diagram and timing for these signals are given in Fig. 5.7. 
Before going through it, it is necessary to mention two points about data 
interface. Firstly, since the Handshake & SYNC unit generates local read and 
write signals, the data from IFCU to host is latched with a signal called Syior 
(Fig. 5.7). Secondly, all data transfers on the BC-bus are treated by the host as 
if it is being done with a single I/O port i.e. when the host addresses a 
particular port (i.e. Csiich*Q), only then the write and read signals appear on 
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the BC-bus. 

§ 5 . 5.1 MICROCODE LOADING 

Schematic diagram of the microcode loading scheme is given in Fig. 5.8. 
Most of the control signals come through a transparent latch (LTHl), while the 
remaining control signals and data are adequately buffered (BFi, BF 2 ). The 
BUFF -Bank consists of eight bidirectional buffers. Each of them are connected 
to one RAM of the RAM-Bank (it is an array of eight RAMs labeled as hRq-hRj). 
A counter and decoder combination selects one pair of RAM and buffer at a time 
while loading the microcode. But during microprogram execution, all the buffers 
are disabled and all the RAMs are enabled. It is to be noted that Lwt = i for 
microcode loading and = 0 for microcode execution. 

The microcode is loaded in a column major order. As mentioned before, 
the microcode reside in an array of 8 bytes of columns and 8i< bytes of rows. 
Each RAM holds a column of microprogram and at a time one column is loaded. 
Loading starts with with flRo and after the writing is finished, the sequencer is 
reset only to start another sequence of WCS cycles to write onto the next RAM. 
Each time the counter (CTR) is incremented to select the proper RAM. 

In the following part, the sequence of operations that occur while loading a 
microprogram, is listed, 
i. At power 141 : 

• F=i ; Output of LTHl is tri-stated. Rdjtic & Wrlic are pulled 

high. F is set to zero by a software command. 


2 . 


Loading of microprogram : 




ISTB 
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starting address (0000) to the sequencer, remove reset. 

• Lite = Wes *i, Rst=!Clr = 0 ; Wes is removed and the sequencer 

waits for the Flag input. 

• Lite = Clr = Wes = i ; RstsO ; Clear to CTR is removed and the host 

can go on writing onto itRo. 

After loading of iiRo is finished, writing should start at ilRi. The 
required steps for this are listed below. 

• Litc = Clr = Rst = i; Wcs=0 ; Reset the sequencer. 

• Ly.c = Clr = i; Rst = Wcs = 0 ; Give WCS instruction. 

• Lite = Clr = Wes = 1 ; Rst = 0 ; Wes is removed, CTR is incremented 

and hence itRl is selected. 

Thus, switching from one RAM to other is done and microcodes are written into 
each of them. 

5 SJ5.2 EXECUTION OF MICROCODE 

Once the microprogram has been loaded, the host directs the IFCU to 
start execution. This can be done by the following sequence after the last 
instruction in iU.R 7 has been written. 

• Liic=0; Clr=Wcs = Rst=i ; Reset the sequencer. Enable all the 

RAMs and disable all the buffers in BUFF-Bank. All the RAMS are 
permanently ^et to read mode. 

• LMc = Rst=0; Sync=Clr=Wcs=l ; Remove the reset and start execution. 
Note that, before doing the last step, Host /array must be made logic low. This 
blocks host control signals to all data control blocks, and connects microcode 
instruction bits to those units (e.g. XIU, YIU). 
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According to systolic principle, all the cells should act in synchronism. 
To ensure this, execution of microcodes in all the cells should start at the same 
clock. This is done in the following manner. After loading microprogram in a cell 
(including cello), the host puts that cell in microcode execution mode. According 
to the ADSP-i40i specification, the first instruction of any microcode has to be 
a simple CONT (continue). Now, in every microprogram, a conditional jump 
statement is deliberately given as the second instruction, which keeps the 
sequencer waiting for an external condition. The condition nothing but the Sync 
hit from the CNTRL Reg's (Fig 5.1). It is fed to the Flag input through the Chtc, 
and is kept low by the host until all the operations to be done in the host-state 
finishes. Once all the cells are ready with necessary data and microcode, the 
host makes the Sync bit set to logic 1. In the next clock cycle all the cells 
which had been waiting so long for the external condition starts execution in 
unison. 

S 5.5.3 USE OF MICROCODE BITS 

The microcode bits are used to control different blocks as shown in 
Fig. 5.1. The main functions performed through microcode control are as follows. 
(Signals written in the brackets are used for the corresponding function.) : 

1. Sending data to next cell (liRdx, ilRdya, llRt^jjj) 

The main function of the microcode is to send X or Y data to appropriate cell 
according to the array configuration. It has to send a write signal along with 
the data, to be written into the input queue of the destination cell. The reading 
and writing of data should be over in a single clock cycle. This can be done by 
latching the read data as shown in Fig 5.9. The read signal (e.g. URdx) is 
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directly sent to the destination cell which generates its write signal by gating 
the read signal with its own clock. 

2. Setting the configuration of the array 

As mentioned earlier , the SASP can operate in two configurations — forward and 
backward. The difference between them lies in the connection of Y-bus only and 
switching between these two configurations can easily be done by properly 
adjusting the X-bar , Mx and DHx. Control bits for these units also comes from 
the microcode and hence the configuration can be changed dynamically during 
program execution. 

3. Reporting the end of execution. 

Once the interface has sent all the data required for execution, it anticipates 
the end of computation and goes on checking the End-op condition. Since the last 
valid computation can be done by any cell, depending upon the size and type of 
the problem, End-op is made an open collector input which is connected to all 
the cells. The cell, last to finish computation, raises the line to logic one, and 
after detecting that the control unit sends an interrupt to the host. 

In the present work, we have tried to apply microcode control in the above 
mentioned cases. The versatility of the interface can be enhanced by putting 
many new functions on it. It would result in an increase in number of functional 
blocks with very little control overhead. This is because, in a microcode circuit, 
extra control can easily be provided by increasing the microcode width. 
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CELL FOR 'SASP' 


ir 6.1 ItfTRODUCTlON 

Systolic arrays are characterized by rhythmic data flow through an 
array of processing elements which have a high degree of structural and 
functional similarity. The power of systolic computation arises from regular and 
systematic use and transfer of data by various processing elements (PE's) called 
'cell's. In the systolic array discussed here, each cell has two data channels 
both at the input and the output, and an address channel as shown below (Fig. 
6 . 1 ). 



FIG. 6.1 A Systolic Cell 
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The computation perf ormed by each cell is of the form 


where comes from the local memory specified by the address on the 
adr-bus. This simple cell configuration supports implementation of many useful 
algorithms, but to get a versatile systolic array computational capacity of the 
cell must be enhanced, In this chapter, a highly flexible cell architecture will be 
suggested (the of the cell design can be found in [Us89]), and the implementation 
of a simple cell will be discussed. 


S 6.2 CELL FOR 'SASP' 

The data path of the cell is shown in Fig. 6.2. The cell is controlled by 
a horizontal microengine and all the datapaths are 32-bit wide, which can 
support 32-bit floating point operations. The cell consists of a queue for each 
interconnection channel (XQ, YQ & AQ), two Muxes for X and Y channels (XMx, 
YMx), two memory banks for resident and temporary data (Hi, M2), a 32-bit 
floating point computing unit (ALU, MUD, a register file pair for the floating 
point units (RALU RMUL), and a local address generator (LAGU). All internal 
datapaths are connected through a completely general crossbar. 
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FIG. 6 ■ 2 Block diagram of the cell 


Following is a description of the major blocks of the cell. 

• Arithmetic Units (ALU, MUL) 

They can be implemented with ADSP-3201 and ADSP-3202 32-bit floating point 
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chipsets. These chips have one level of pipelining with minimum cycle time of 40 
ns and a minimum latency of two clock cycles. 

« Data storage units (Ml, M2, RALU, RMUL) 

The local data memory Ml is used for storing constants, intermediate data, etc. 
and thereby reducing I/O bandwidth requirement of each cell. By using a local 
memory for storing intermediate results, a cell can be multiplexed to implement 
the function of multiple cells and the linear array can be used to implement 
algorithms designed for two dimensional systolic arrays. The backup memory M2 
increases the memory bandwidth. Addresses for both the memories come from the 
address crossbar. Multiport 64x32 register files (RALU, RMUL) are used for 
buffering data for ALU and MUL. These data storage components increase the 
computational bandwidth by permitting flexible data routing through their five 
ports. ADSP 3128 register files can be used for this purpose. 

# Crossbar (AX-Bar, DX-Bar) 

The arithmetic units can process four data items and output two data items per 
computation cycle. To support this high data processing rate, flexible 
interconnection among data processing units, data storage units, and I/O units 
are reauired. A general purpose crossbar (DX-Bar) serves this purpose. It has 
five input ports and seven output ports. The crossbar can be reconfigured 
every clock cycle under the control of microcode. The address crossbar (Ax- Bar) 
is used to select any one of the two address sources (i.e. external address bus 
and LAGU) and supplies addresses to the two data memories (Ml & M2). 

• Input Queues (XGL YQi adrQ) 

A fall through queue on each of the input channels increases the intercell 
bandwidth and also ensures that data and address are properly synchronized at 
the time of operation. Standard 512X8 FIFO chips (e.g. CY7C412) can be used for 



Chapter 6 


8i 


this purpose . 

• Input multiplex BT (YMx) and XLth 

These are used to support various inodes of array configuration. 

a. ) ForwRrd mode : in this mode, both X-data and Y-data items flow in the 

same direction i,e. from left to right. 

b. ) Rei/erse mode : In this mode , X and Y travels in opposite directions. 

While X flows from left to right, Y flows from right to left. This bi-directional 
data flow is required for many systolic algorithms (e.g. LU-Decomposition). 

c. ) UrBp'-around mode* To handle large size problems, the algorithm has to be 
partitioned properly. In one partitioning method called LSGP scheduling 
(Chap-4)| a single cell is multiplexed to perform the function of several cells . 
This requires output of a cell be fed back to the inputs which is supported in 
this mode. The XLth is used to hold X-data after it has been taken out from the 
XQ. This allows use of X-data a number of times once it has been transmitted to 
the cell. 

# Local address generation unit (LAGU) and microengine. 

This unit is the same as the one used in the interface cell and has already been 
discussed extensively in Chapter-5. The local address generator is used for 
data dependent addressing which is local to a particular cell. 


I 6.3 DESIGN OF A SIHPUFIED CELL 

Although the goal of this thesis was to design and implement a part of 
the interface unit for the systolic array, a small cell has been added to test 
the data transfer and data handling activities of the interface. For this 
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purpose, we have chosen three algorithms (matrix multiplication, linear 
convolution and AR filtering) for implementation, and a minimal cell structure 
was designed to execute these algorithms. The basic computational unit used 
here is a multiply-accumulator (MAC) chip; in addition, the cell has a local RAM 
of 8K capacity to store resident input data. Other inputs for computation come 
from either X or Y bus. For controlling the operation of the cell, a few bits of 
the microprogram residing on the interface have been used. All data paths of 
the cell are 8-bit wide and support only 8-bit operations. 


5 6.3.1 BLOCK DIAGRAM DESCRIPTION OF THE CEU. 

Block diagram o-f the cell is given in Fig. 6.3. 

Following is a description of each block. 

• Input latches (XLth, YLth, 0/PLth) 

These are used to latch the X and Y data coming from the interface and o/p data 
being returned to the interface. These latches introduce one stage of pipelining 
at each input and output. 

• Buffers ( adrBf, DBf, 0/PBf ) 

Address and data are buffered wherever they are taken from or put onto the 
external bus. The DBf is used to buffer data on the global BC-bus and is used to 


load the local memory Mi. 
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FIG. 6 ■ 3 Block diagram of the simplified cell 
• Data multiplexers ( IMx, OMx) 

One input of the MAC can be chosen from either X or Y bus by selecting the 
proper IMx channel. The OMx gives a loop around facility for Y data input which 
can be recovered unchanged even if the MAC operates on the Y data. This 
feature is required in AR filtering problem where an output item is used to 
compute next N outputs and hence it has to be retained for those many cycles by 
looping it around. 
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• /Arithmetic unit (MAC) 

The multiply-accumulator does the following operation on its nth operation 
cycle : 


= On + lift + 12ft . 

Oft can be the output of the operation done in the (n-i)th cycle, or it can be 
preloaded with Y data in the nth cycle itself. ADSP-i008A MAC chip has been 
used for arithmetic operation. 

• Data storage ( M) 

An 8KX8 RAM has been used to store one set of inputs (e.g., system coefficients 
for AR filtering). The memory is preloaded by the interface prior to cell 
operation and is read during execution. In both cases it is addressed via the 
adr-bus by the address generator of the interface. 


ff 6.3.2 HARDU/iRE DESCRIPTION OF THE SIMPLIFIED CELL 

The schematic diagram of the cell is given in Fig. 6.4. The ^-marked 
signals are bits from a microcode field of the interface . Clk is the system 
clock while ClkB is another clock having twice the frequency of Clk. The phase 
relation between Clk and ClkS is given in Fig. 6.6. Before starting execution, the 
local memory (M) has to be loaded with the data write facility of the interface. 
The address of the memory comes from the adr-bus which is the output of the 
interface AGU. To provide address through AGU, a small microprogram is first 
loaded and executed on the interface. The program initializes a register <R;) of 
the 1410 and the 1410 outputs the contents of the register on the adr-bus. 
Simultaneously, the sequencer (1401) comes to a state where it waits for the Fig 




Fig. 6.4 Schematic Diagram of the Simplified Cell 
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condition. Now, if the host has executed a data write into the cell (in this cycle 
a write operation is done on an 1/0 port called mmc with the control bits & 

lWc=0), data sent by the host will be written into 'M' at an address equal to the 
content of R;. After the write has been finished, Fig to the sequencer is 
asserted and the sequencer moves over to the next microcode instruction where 
the 1410 modifies its R.-.The same process is repeated until the host finishes 
writing into 'M'. Flow chart of this process is shown in Fig. 6.5. 


After finishing data loading in the manner described above, other data 
RAMs on the interface are written. Finally the microprogram, to execute the 
algorithm, is loaded and control is transferred to the array by making 
Host/array = Lu>t = LUc = 0. The MAC has two control lines iAcc, Sub) to specify 
the operation to be performed. These lines, alongwith the input data lines (11, 12), 
are latched into the MAC at the second phase of the clock (Clk) and hence MAC 
operation is performed in the second phase only. The output of the operation is 
available at around the end of the clock cycle and it is latched into 0/PLth at 
the rising edge of the next clock cycle during which the output is read by the 
interface. The Acc & Sub lines can be used to define the MAC operation as 
listed below. 


Acc Sub 

i ) 1 1 

i i ) 1 0 

i i i ) 0 X 


MAC operation 
0(„+i) = X„ * Y„ - 0„ 

= Xn * Yn + On 

0(n4i) — Xn ^ Yft 


As mentioned before. On can be preloaded during the first phase of the 
nth cycle, while operation on On is performed in the second phase. During 
preloading, the Prel input of the MAC should be held high, output bus should be 
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tri-stated (Tsl * high) and und«r this condition data at the output port is 

preloadad at the rising edge of the ClkP. The data on the Y-bus can be 
preloaded through BUFFi in this design. For preloading microcode control bit 
Prl is made high. Other microcode bits are used to specify MAC operation and to 
select the IMx and 0/PMx channels. Timing diagram of the operations are given 
in Fig. 6.6. 


S 6.4 EXECUTION OF ALGORITHMS ON THE CELL 

The cell has been designed to execute three algorithms- matrix 
multiplication, linear convolution and AR filtering. The mapping methodology 
discussed in Chapter-3 can be used to map these algorithms on this single cell 
systolic array (!). But in case of a single call structure the data dependencies 
of these algorithms are easily understandable and it is not really necessary to 
go through the trouble of projection, scheduling, scaling etc. The input and 
output sequence can be found directly through observation and are tabulated in 
Tables 6.i, 6.2, & 6.3. 

It is to be noted that since a single cell has been used, the output is 
produced after sequential execution of a number of intermediate stages. Those 
intermediate results may have to be looped back into the cell through Yg while 
the outputs are fed into Y^. 

% 

With a single cell, obviously the advantages of parallel processing are 
not achieved but the merits of systolic data flow can still be felt. From the 
tables we see that the number of clock cycles required for execution is almost 



r 
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equal to the number of arithmetic operations (multiply or multiply-add). I/O 
overload is almost negligible. 

Microprograms for execution of the algorithms are given in appendix-2. 
The tight pipelining of 1/0 operation was not maintained in practice because of 
the following reasons. 

The interface has been designed with the assumption that it should be 
connected to intelligent cells with their own control units. Accordingly a 
blocking scheme has been implemented, which blocks the clock of the appropriate 
cell in case of any overflow or underflow of data buffering units during 
intercell communication. Since the rest of the array continue with execution, 
normalcy is soon restored and the blocked cell is brought back into operation. 
But in present case, there is only one controller controlling the interface and 
the cell. If it goes into blocked state due to overflow or underflow of the Yg 
RAM, there is no way of recovery from that state. Therefore, the operation 
cycles have been sparsed sufficiently to avoid data contention. 


S 6.5 A SOFTUARE UTILITY TO USE 'SASP' 

To put SASP into service, the host has to initialize it with data and 
microprograms. These operations require frequent manipulation of control 
registers, communication with different 1/0 ports etc. To carry out these 
operations with simple commands, a program has been developed. This program 
enables the user to communicate with all the I/O ports in different access 
modes, e.g., single byte 1/0, multiple 1/0 to transfer string of data, transfer of 
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files etc. These functions are executed at user commands followed by suitable 
command-tails depending upon the function. The details of the commands and 
functions performed by them have been listed in Appendix-S. 
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SUMMARY AND CONCLUSIONS 

Systolic arrays combine the characteristics of array processing and pipelining 
in a regular pattern of interconnected nodes. Suitable algorithms are mapped on 
the array in such a way that the data dependencies of the computations match 
the interconnection pattern of the cells. 

A systolic array signal processor (SASP) has been described. It is a 
linear array of microprogrammed cells connected to an external host through an 
interface cell which deals with all the data pumping and receiving operations. 
The host communicates with individual cells through the interface for loading 
microcode and data. 

An interface cell has been designed and part of it has been implemented 
on hardware, A small cell has also been made and a few algorithms have been 
executed on this simple prototype. The interface can be accessed through a PC- 
XT and a program has been written for loading data and microcode onto the 
interface and for reading output from it. 


To execute an algorithm efficiently on an array structure, it should be 
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mapped properly so that its data dependencies match the array interconnection 
pattern. Before settling on a design, a detailed study of the algorithms to be 
executed on the particular structure should be carried out to have a clear 
understanding of the different kinds of data routing that may be required 
during execution. This is to ensure that the final design supports a flexible 
data path which takes care of all' the data dependencies. Although 'systolic 
description' of a handful of algorithms are readily available, different array 
structures are proposed for different types of algorithms. LU and QR 
decomposition algorithms, which are generally described on a mesh or hexagonal 
array have been mapped on linear array structure and this has helped to 
understand the demands on data passing mechanism. We feel that a more detailed 
and comprehensive study is required on the algorithm side for a more versatile 
design. 

The SASP architecture has the attributes of a versatile computing 
machine. Because of time constraints we attempted to develop a simplified 
version using readily available simple LSI and MSI components which could not 
exploit all the features of the advanced IC's like 1401 and 1410 ; the flexibility 
of architecture has also not been fully implemented. Prime motivation of the 
current work was to have some experience with this architecture and also to 
check the feasibility and viability of such systolic array processor. To extract 
the full computing capability from a SASP type architecture we will require an 
enhancement of the current design with advanced high speed chips. A cycle time 
of iOOns can be achieved (this gives a throughout of ION MFlops where N is the 
number of cells), whereas the present circuit with LS chips operate at a much 
reduced clock of 1 MHz only. Further increase in speed is possible with a 
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switchover to wavefront principle. 

Software aids are required for programming the SASP. An optimizing 
compiler can hide the low-level details of the machine and allow the user to 
concentrate more on the parallelism at the array level. The W2 language 
developed for the CMU Warp processor [Annr873 is an example of such software 
support. But attempts in this direction can be made only after the system details 
are fully defined. This area has been beyond the scope of the current work and 
can be taken up as a future enhancement of the SASP system. 
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registers, denoted B„,. and can execute in i single-cycle the core 
OiyrraiJon: 

Y-<-R„;R«-a-R„4B„. 

In DSP applications, data arrays arc often addressed as circular 
buffers. That is, when addressing reaches the buffer end, it 
wraps back to the beginning of the buffer. To implement this 
looping, the AG compares the supplied address to one of four 
compare registers, denoted Q. If the address has moved to or 
beyond the end of the boundary (R„>Q), the device can 
transfer an initialization register value, denoted I,, to the register 
(I^d"^* Ii)j otherwise, it is updated in normal fashion 

Bm). To minimize overhead, the AG can execute 
normal updates while also performing conditional rc-iniuilizaDons; 
again, in one core operation: 

R,; IF (R„>Ci): R„-<- l^; ELSE 

Since the above instruction handles the looping required of 
circular buffer addressing, it is termed a loopmg instruction. To 
a large extent, the ADSP-1410‘s architccrurc and instruction set 
revolve around cfTicicnt implcmcnution of this instruction. 
However, many variations of this instruction arc supponed on 
the device and spelled out in the following sections. 
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ADSP-HIO PIN ASSIGNMENTS 


PIN NAME 

DESCRIPTION 

0 
>- 

1 

TTic address (Y) output port. In single-chip 'double- 
precision mode, the MSB (Y 15 ) indicates whether 
the supplied address is the MSW or LSW (see 
Precision Modes). In two-chip double-precision 
mode, the MSB conveys the carryVshifi bit from 
the Least Significant (LS) to the Most Significant 
(MS) chip. 

o 

0 

1 

d 

The bi-directional data (D) port. In two-chip 'dou- 
ble-precision addressing mode, the MSB (Djs) of 
this port conveys CMP status from the partner 
chip. 

U - lo 

The instruction pon. 

CMP/2 

A dual function pin. Looping instructions, which 
compare address register values to compare 
register values, assert this pin HI to convey 

CMP status if i) R^C for positive offsets, or 
ii) R-^C for negative offsets. Logical- Shift in- 
structions assert this pin HI to convey the ZERO 
status of the result. 

DSEL 

Data Select control. Assening this control HI 
causes data set up on the data pon to subsiitutc 
for the R value specified in the instruction. 

AIR Enable 

Alicrnaic Instruction Register control. Asserting 
this control HI causes the device to execute an 
insiniction stored in the internal AIR, rather 
than the instruction set up on the instruction 


pon. 

CLK 

ClcKk 

Vou 

■f 5 Volt Power Supply 

GKD 

Ground 


3 ■^e MlCMOCOniD SUPPORT COMPONENTS 



r 

INI* 


T*nnu^*ti > 


tK»N 



A|| ure i A DS P^1401 Block Dictgrdm 


ADDRlS^tNG MOr^lJ 

Wrfi'l hifh ilw^hitr tml rrlMtivc 
Imijsfni. frinii mirnisil RAM 

HARDM'ARf, I'f’AI'l’RKS 

f»ort 

BnlifriUoMl t%u fun 

fmt thimx Muliiplcx^f 

Thfff StAik FuiiUm 

Four livcril C‘ou«ttn 

G^nditton 

Fjihf ffionii/fd iruJ MsK^^’thfe Uicr luterrupu 

TIK Pm- 
Inp 

Rcvt 

INStRUCriON 1'VPFS 

Jumrn utnl Hfan* hrf. 

0|»<r4»jt>ni 

Sf^tu\ Rcgi^trf f)iM‘raU<>fn 
C^nuHff C>|tff4tH»n^ 

Inirtrupt ('.unioA 

R<*l4U've Adilrcw USdth Gtmrot$ 

If>i.trtH rion Hold 
WiHMWf Gifiirol %twt; 

C'ouiurr HudfiPtm I/Mtffwpt 

Dfdkjiird Overflow Iriffrrupi 


ADSM401 PIN ASSIGNMENTS 


Pin Name 

U-h 


Y,j-Yo 

D,5-Do 

EXIR4., 

CLK 

FLAG 

1TR 

VpD 

GND 


Description 

The 7-bii microinsinjction controlling the 
ADSP-140L 

Ouipm bus which provides addrcsscsto the micro- 
program memory. 

Bidirccoonal Data bus for transferring data to or 
from the ADSP-140L 

Four external interrupt request lines. Note that in- 
ternal circuitry' supports 8 interrupts w'ith the aid of 
an external 2 to 1 muliiplcxcr. 

External clock input 

An input used for conditional instructions. Its 
source is usually a condition multiplexer. 

A multi-purpose pin accommodating traps, output 
disable and reset. 

4 5 Volt supply. 

Ground. 
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fill titm »r»d tipatittmt mfiwjrrmcDt, wr t*,n drirrminc ihtt 
the iKSik current tn h driver it 

where AV/At n the tninil \U^ r*ie. 

In the i:*vr uf the progr«n\ vr^urrMcr, for in rxicrrul load capaci- 
tirne of ^Oph arul a for^^urfd %\cv. f»tf of 0 6V/ns, the peak 
cuircnt Will Ik* aUuH ^ChuA Sukc ihrrr are 16 such drivers, the 
total I'cak (lirrmt nuv upproa^h 480mA’ 


4.S Mnemonic* >nd OjHHHie* 

Oixtnic hits "u" M-irvi the fclevani register (Ej .o) and'or counter 
(C) o) OjHtHic hits '\c** vrlrct the condition to bc applied: 

W UNCONOinONAL 
Xl'F NOT FLAG 
»10" FLAG 
nr SIGN 

The SKJN conditfon ii precluded from mstnictioni prefixed 

with *’*’'*. 


The internal ground and supply lines may undergo a large dis* 
Turbance during this transition unless the ADSP-1401 is tied to 
I solid ground plane and gcxxl high frequency decoupling is 
used (0.1 pF ceramic between GKD and Vi>d is close as possible 
to the device). Otherwise, is it possible that internal dm in the 
ADSP-1401 may bc lost. 



Status Kepstrr Bit AsstiimoeaU 

Bit# 

Function (HILO) 


IKrWTtlL’Bir” 

SRi 

IR<,Mask Bit 

SRj.4 

ReUiive Jump Width Selection; 

*00' 16-bii relative address width 

•or -t-bU width 

*10’ •• IHC Mc«d< (8-bit width) 

*1 !**► 12-bii width 

SR» 

Select GSP/LSP 

SR, 

Enablc'Disablc Interrupts 

SR, 

SeLOca.r Sijp Bit 

SRo 

Scica Traijsparcnt'Latchcd Interrupts 



IL 


!!♦*#<( 

f H InsimiMM*#' 

|K m 

<KH 0101 

D II All JtiMP K: 

m HI 

Oil (HOI 

U NDI M AC. JUMP PC 

tiwi/i 

JIW'O 

101 mOI 

IMDKD JUMP Kii 2 

IDA 

m 

D t'DNt» Jl MP DATA, 
ahmh im 

Itm 

m mO! 

D i tlND JUMP DATA, 

H H A t IVI 

JDt 

101 It 10 

U tuND JUMP DATA, 
JNDIHHT 

jtmvr 

100 n n 

D VKiN DP L JUMP DATA. 
«.V - R., PI 1 

*|IH 

n 0 * t ♦ » 

11 ( oKD JUMP R, 

m 

no ID 1 

II MiiN DP JUMP R„ 

^U' 1 

lU 

m t » fK> 

D (DSD Jl MP SUB, 

AHSDf ITIi 

1%K 

m 10 

11 i ttSD JUMP SUB, 

HPI AUVP 

H rn 

ffti tt n 

D ( dND K1 rURN FROM 
SUB 

*HKA\r H 

HR' tut 

If SH,N 01 (., JUMP R,’, 
JISI.C..- -C, DIP COND 


|rMl‘ DATA 


nm% 001 1110 

mst? oil mo 

wKssp 000 mo 

010 IHKi 

insp r»o</0oio 

s.r,sf‘ (XK. um 

Msr 000 01 It 

010 mt 
WkWSP (KM'I llW 

OH* Out I 
PV/H' £KK) 0101 

PHH4' OltKi 

PVrmS 001 ml 

ppHMt on mi 

MRAP CM 0 1 Oil 

MU\r tHRi mi 

S4HSP on IHKt 


H'sn DATA ONTO S5. 
ntf SS TO fiATA K)kT 

«'Hrrf %sp 

po-adssp 

OH,kI MI vt SSP 


AH n r osp 
AM.n 1 i.sf* 

HI AD HSP 

la'Hiri 

pi AH H ONTO K8 
Pt'AH (AP ONTO %% 

IMP (AP 1 k<»M .SA 
PI AH DATA DNTO HS 
IMP HA li) DATA K>RT 
ADI» I TO KSP 
Al HI PACT I f ROM KSP 
At HlHAt I 4 I ROM KSP 


Sutuii KeidHcr Op<r»6oii«: 


RDSR 

010 mo 

W'RSR 

001 1100 

PSSR 

010 0001 

PPSR 

0)0 0010 

Counter Operationa: 

VCKCNIR 

oil loii 

CLRS 

001 0)00 

st/rs 

on 0100 

PSCMTR 

000 lOii 

PKINTR 

001 lOii 

DCCKTR 

oil OOii 

IFCDEC 

101 ccOO 

Interrupt C*ontrol: 

CCIR 

001 OOOl 

CAIR 

OOO 0001 

RTNTR 

000 00 n 

RDJV 

010 noi 

WRIV 

OOO 1 101 


IRMBC 

001 

oon 

IRMBS 

001 

0010 

DISIR 

001 

ono 

EKAIR 

on 

ono 

SLIR 

001 

oin 

sriR 

on 

oni 

SLRIVP 

001 

noi 


HEAD SR 
WK \ : E SR 
PUSH SR ONTO SS 
PAtP SR FROM SS 


WRITE C, 

CLEAR SIGN BIT 

SET SIGN BIT 

PUSH C. ONTO SS 

POP C FROM SS 

DECREMENT C 

IF COND: DECREMENT Q 


CLEAR CURRENT 
INTERRUPT 

CLEAR ALL INTERRUPrS 
RETURN FROM 
INTERRUPT 

READ INTERRUPT VECTOR 
AND INCREMENTTAT 
WRITE INTERRUPT 
VECTOR AND INCREMENT 
IVP 

IR mask bitwise CLEAR 
IR MASK BH-WISE SET 
DISABLE INTERRUPTS 

enable interrupts 

SELECT LATCHED 

interrupts 

SELECT TRANSPARENT 

interrupts 

URITE SLR-. -D^.i AND 
1VP-. - Di> , * 


elativ« Ad<lnc4,i Width Controls: 

SELECT 16 Bn RELATIVE 

addressing 

SKI.ECT i:-BlT RELATIVE 
addressing 
SELECT U-BIT RELATIVE 


RELH* 

010 0)00 

RKI.12 

010 oni 

REh* 

ovo ono 




Mivcellanifouv InUructions: 


CDNT 

000 0000 

IDLE 

001 0000 

IHC 

010 0101 

wes 

010 0000 


CA)NT1NUE 

IDLE 

enable INSTRUCTION 
Hin.D CONT'ROL 
WRITE CONTROL STORE 
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overview ot adsp 1410 


D,v. 



Figure 1- AOSP- MTO Functional Block Diagram 
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Figum »0 Ckjck Cycle Time vs Temperature 


Figure 1 1. Typical loo vs. Frequency of Operation 
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Instr. Opcode (If.o) Description 


Looping Instructions 

YlNC^t* lOllccrrrr 
YDECn: lOlOccrrrr 
YADD'f: llccbblrrr 
YSUB^t: llccbbOrrr 

Regisier Transfer Instructions 

YRTR*: OOOlOlrrrr 

YRTB*: OOllbbrrrr 

YRTC*: OOlOccrrrr 

DTI: 00001 1 1 lii 

ITR: lOOOiirrrr 

BTR: OlOObbrrrr 

RTD; OOOlOOrrrr 
errO: OOOOllOOcc 

BTD; 0000 n 0 1 bb 
ITD; OOOOniOii 

Ugical and Shift Instructions 

YOR*t' Ombbrrrr 

YAND*t'* OnObbrrrr 

YXemn: OlOlbbrrrr 
YASR*t: OOOUlrrrr 
YLSI/t: OOOllOrrrr 
Control Register Instructions 
0 000 000001 
oboofTTTio 
0000101 1 1 1 
OOOOlOOi i X 
OOOOlOlOpp 
OOOOOlOOlx 
OOOOOl lOlx 
000001 lOOx 

00000101 lx 

OOOOOlOlOx 


output & incrcrntni/init 
output decremeni/inil 
output & add ofTscl/inii 

output & subtract ofTscl/inii 

output & xfrRtoR 

outputs xfr Rio B 

output &xfrRioC 

xfrDioI 

xfrltoR 

xfr B 10 R 

xfr R to D 

xfrCioD 

xfr B 10 D 

xfr 1 to D 

output & OR B with^io R 

output & AND BwitVtoR 
output &XORBwith^toR 

output & arith SRR to R 
output & logical SL R to R 

reset CR 
xfrDioCR 

xfrCRioD 

set cond rc-init on CMP mode 
set chip precision 
set Y port to trafns latched mode 
select upper. lower R bank 
select upper.lowcr B bank 
set post^pre update mode 
set cond AIR mode 


RST: 

DTCR: 

CRTD: 

SETI: 

SHTP: 

SHPY: 

SHLR: 

SELB: 

SHTU: 

SHTA; 


air Instructions 

WRA: 0000101100 

RDA; 0000 1 0 1 101 

LDA: 000001 1 1 10 


H rite AIR with D 
read AIR at D 

loadAIRonncxtcycle 


^Qp. 0000000000 noopcration 


MICROCODEDSUPPOFIT COMPONENTS 



MlfRcK-ODE FOR MATRIX MULTIPLICATION 


on on on no 72 7n no 60 6a 00 fe 62 00 fe 6a 
ff 0 ^ ns no no no ££ ££ ff ob 06 ff 01 05 ff 
ff on no 00 80 80 ££ ££ ££ 00 00 ££ 01 00 ££ 

00 20 34 30 40 40 40 bO bO bO 00 30 c6 20 00 
40 20 20 20 on 00 01 01 01 00 00 00 00 00 00 
78 88 88 88 88 00 80 8a 8a 8a 88 88 88 88 88 
£0 £0 £0 £0 £0 £0 £0 £0 £2 £2 £8 £0 £0 £0 £0 


MICROCODE FOR CONVOLUTION 


00 no 72 70 no no 60 OO 62 fe 70 00 00 60 £e 00 00 00 6a 
ff 00 on 00 ££ t( ff 07 ff it 04 00 £f ff ££ Of £f ff ff ££ 
ff 00 no on fi ff a on tt it oo oo ff ff ff oo ff ff tt tt 

20 14 14 14 14 14 u 14 14 14 bO 14 14 14 14 14 14 14 14 14 
20 20 00 0‘j D*. 08 Od Od 09 08 08 OS 05 05 05 05 01 00 00 00 
88 as as 88 «o 80 no ao ao 88 ss so ao so ss so so so so so 

fO £0 fO £0 £3 £b £3 £3 f3 £0 fO fO £3 fb fb fb fb £8 £8 fO 


MICROCODE FOR AR-FILTERING 


00 on on nn no 70 IZ oo oo 72 62 tt 60 tt 6a eO 
ff 40 41 ff no 08 00 ff ff 02 ff Oa ft 07 ff 07 
ff 00 00 00 00 80 ao tt tt eo ff oo ff oo tt oo 

00 14 14 28 30 00 00 40 14 b8 14 bS 00 40 00 00 
40 20 60 ftO aO 00 00 81 44 80 04 88 08 80 00 00 
SO 88 88 88 88 88 08 88 88 88 88 88 88 88 88 SS 
£0 £0 <0 tn to to fO fO fO e2 £2 c2 £8 £0 fO £0 


B, row i I t.pr the code contents of RAM! of the microcode 

tl”barik. 



Appendix - 3 
Manual for using SASP 


fiAnr 

Ext^cu*r' JJAUr to '.tMi t ht% fromot SA£P> , Services provided under 
tt-is:* I'f. •!•>(. » 4 r .» below Some of the commands are 

follaw»*d u'-vf.fii <ind . l ,14 I i containms a number of arguments 

I juic'd th 0 i.t and separated by commas. To get a 

oarticulaf s.# r ; a i» . th((« ct'f rtj spend i ng comriiand should be given 

with Dr*i.)P#j ,if H Command and individual arguments 

Should b* #»naf 4 t*d tr-om each other by a blank. 

i . 0 <iKW't mwM*, d.it*i{oi>t tonal )> 

Output* ont* t-yt# to th« port In case no new data is given, the previous 
data 4* cutout. 

7. 1 <m>rt m»mr> 

int'Utt trrm port and displays it on the screen. 

3, 1,0 <i:wj'rt rupm, ciata, niimh»r> 

(iM.i ><:, tt<« POT'i continuously for a number of iterations specified in 
ltrt> rH4n^H*r uvld of the command-tail, An infinite loop can be implemented by 
fivinf only ih# first two arguments and omitting the number. 



4. LI <port number > 

Inputs 6-bwte d<ita from the port continuously for a number of iterations 
SDi‘cificd by the mmbtH' Infirutf Ifwos can be made by omittina the number 
argument 

5. FLQAD <i>ort na/ne, full path description of the file> 

Loads the file specified in the path into the port. 

Clears all the counters which generate local addresses for the data RAMs. 

7. WCS 

Givti a Ucs instruction to th® sequencer, 

8. SUM 

Switch®* from on® microcode RAM to the other by giving a Wes instruction to 
the *eau«ncer (see Chap-S), 

9. START 

Signals th® sequencer to start execution, by setting the control and status 
registers, 

LIST QF PORTS 

RO, Rl, R2 -♦ Control registers. MMC -+ Microcode RAM. 

MX 4<-RAM. MWT -» Data RAM of the cell. 


MYA. MYBl -♦ Y-RAMs. 



